Re: Detect string has non-ASCII chars without checking each char?

Vlastimil Brom

unread,

Aug 21, 2010, 4:21:10 PM8/21/10

to pytho...@python.org

2010/8/21 <pyt...@bdurham.com>:
> Python 2.6: Is there a built-in way to check if a Unicode string has
> non-ASCII chars without having to check each char in the string?
>
> Here's my use case: I have a section of code that makes frequent calls to
> hasattr. The attribute name being tested is derived from incoming data which
> at times can contain international content.
>
> hasattr() raises an exception when passed a Unicode attribute name. I would
> have expected a simple True/False return value vs. an encoding error.
>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u012c' in
> position 0: ordinal not in range(128)
>
> Is this behavior by design or could I be encoding the string I'm passing
> hasattr() incorrectly?
>
> If its by design, I'm thinking the best approach for me would be to write a
> hasattr_enhanced() function that traps the Unicode encoding exception and
> returns False and use this function in place of hasattr(). Any thoughts on
> this strategy?
>
> Thank you,
> Malcolm
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
>
Hi,
I can't comment on the mentioned usecase, but for checking the basic
ascii unicode strings one can maybe use a simple hack (not sure about
possible drawbacks ...)
It is likely working with all characters too, but maybe in a more
straightforward way...

>>> a = u"abc"
>>> b = u"abc\u012c"
>>> a.encode("ascii", "ignore").decode("ascii") == a
True
>>> b.encode("ascii", "ignore").decode("ascii") == b
False
>>>

Others may supply more general/elegant/... approaches.

vbr

John Nagle

unread,

Aug 22, 2010, 1:40:07 AM8/22/10

to

On 8/21/2010 1:21 PM, Vlastimil Brom wrote:
> 2010/8/21<pyt...@bdurham.com>:
>> Python 2.6: Is there a built-in way to check if a Unicode string has
>> non-ASCII chars without having to check each char in the string?
>>
>> Here's my use case: I have a section of code that makes frequent calls to
>> hasattr. The attribute name being tested is derived from incoming data which
>> at times can contain international content.

Bad idea. Use a dict; don't try to pretend that an object is a dict.
This isn't Javascript. Incidentally, inheriting from "dict" works,
and is quite useful.

class item(dict) :
...

p = item()
p['abc'] = 1

That wasn't in early versions of Python, which led to a style of abusing
objects as if they were dictionaries.

Also note that 1) spaces in attribute names can be troublesome, and
2) duplicating the name of a function or built-in attribute will
override it, usually leading to unwanted results.

John Nagle

Michel Claveau - MVP

unread,

Aug 22, 2010, 3:07:04 AM8/22/10

to

Hi!

Another way :

# -*- coding: utf-8 -*-

import unicodedata

def test_ascii(struni):
strasc=unicodedata.normalize('NFD', struni).encode('ascii','replace')
if len(struni)==len(strasc):
return True
else:
return False

print test_ascii(u"abcde")
print test_ascii(u"abcdê")

@-salutations
--
Michel Claveau

John Machin

unread,

Aug 22, 2010, 6:57:04 AM8/22/10

to

On Aug 22, 5:07 pm, "Michel Claveau -

MVP"<enleverLesX_XX...@XmclavXeauX.com.invalid> wrote:
> Hi!
>
> Another way :
>
> # -*- coding: utf-8 -*-
>
> import unicodedata
>
> def test_ascii(struni):
> strasc=unicodedata.normalize('NFD', struni).encode('ascii','replace')
> if len(struni)==len(strasc):
> return True
> else:
> return False
>
> print test_ascii(u"abcde")
> print test_ascii(u"abcdê")

-1

Try your code with u"abcd\xa1" ... it says it's ASCII.

Suggestions:
test_ascii = lambda s: len(s.decode('ascii', 'ignore')) == len(s)
or
test_ascii = lambda s: all(c < u'\x80' for c in s)
or
use try/except

Also:
if a == b:

return True
else:
return False

is a horribly bloated way of writing
return a == b

Michel Claveau - MVP

unread,

Aug 22, 2010, 11:10:26 AM8/22/10

to

Re !

> Try your code with u"abcd\xa1" ... it says it's ASCII.

Ah? in my computer, it say "False"

@-salutations
--
MCi

John Machin

unread,

Aug 22, 2010, 6:13:57 PM8/22/10

to

On Aug 23, 1:10 am, "Michel Claveau -

MVP"<enleverLesX_XX...@XmclavXeauX.com.invalid> wrote:
> Re !
>
> > Try your code with u"abcd\xa1" ... it says it's ASCII.
>
> Ah? in my computer, it say "False"

Perhaps your computer has a problem. Mine does this with both Python
2.7 and Python 2.3 (which introduced the unicodedata.normalize
function):

>>> import unicodedata
>>> t1 = u"abcd\xa1"
>>> t2 = unicodedata.normalize('NFD', t1)
>>> t3 = t2.encode('ascii', 'replace')
>>> [t1, t2, t3]
[u'abcd\xa1', u'abcd\xa1', 'abcd?']
>>> map(len, _)
[5, 5, 5]
>>>